Xet Storage Not Deduplicating for Even Simple Binary Files

lyk · May 18, 2025, 4:33pm

I have migrated to xet storage. And today, I try to test that is the xet really working?

My test is, simply generate a all one (int) array using numpy, and upload it to huggingface.

import numpy as np
a = np.ones(10000000,dtype=int)
np.save("./one.npy", a)

And update it.

pip install -U "huggingface_hub[cli,hf_xet]"
huggingface-cli.exe upload lyk/XetTest . --repo-type=dataset

Start hashing 1 files.
Finished hashing 1 files.
Uploading files using Xet Storage..

It shows that I am using xet but finally I got the LFS storage at 40MB, just as large as the raw simple file, no deduplication.

Well, maybe it only dedupilcates history commits. And I just generate a twice large file

import numpy as np
a = np.ones(10000000,dtype=int)
np.save("./one.npy", a)

Then I upload it and get a 120MB LFS storage usage.

And during the whole process, the progress bar in terminal shows that I uploaded the whole files(40MB and 80MB) although xet is enabled.

I don’t know why xet does not work. Any thing wrong?

John6666 · May 19, 2025, 12:38am

I think it’s probably just a bug, but I’m not sure where to report it…

lyk · May 19, 2025, 1:27am

And something even stranger is that, no lfs file is removed after super squash, although all history is removed. I only find this kind of situation in my test repo, and super squash works well in my other repo.

Well, I did something more than uploading bigger file. I uploaded one.npy, updated it, and then upload the raw one.npy again. Maybe that is the reason

lyk · May 19, 2025, 1:39am

Xet Storage Not Deduplicating for Even Simple Binary Files · Issue #3090 · huggingface/huggingface_hub

Xet Storage Not Deduplicating for Even Simple Binary Files · Issue #343 · huggingface/xet-core

lyk · May 19, 2025, 1:51am

xet-team/README · Xet Storage Not Deduplicating for Even Simple Binary Files

John6666 · May 19, 2025, 4:13am

I see.

From github:

rajatarya

The file size shown when uploading will reflect the total file size for the file, not the deduplicated file size. The experience of deduplication will be the time necessary for completing the file upload will be less due to less bytes traveling on the wire.

Deduplication will occur across all files, not just across commit history.

lyk · May 19, 2025, 5:47am

Does this means that a frequently appended parquet file will not be dedupilacted in block level?

John6666 · May 19, 2025, 8:06am

I wonder.

github.com/huggingface/xet-core

Xet Storage Not Deduplicating for Even Simple Binary Files

opened 01:30AM - 19 May 25 UTC

closed 05:05AM - 19 May 25 UTC

Yikai-Liao

I have migrated to xet storage. And today, I try to test that is the xet really …working? My test is, simply generate a all one (int) array using numpy, and upload it to huggingface. ```numpy import numpy as np a = np.ones(10000000,dtype=int) np.save("./one.npy", a) ``` And update it. ``` pip install -U "huggingface_hub[cli,hf_xet]" huggingface-cli.exe upload lyk/XetTest . --repo-type=dataset ``` ```text Start hashing 1 files. Finished hashing 1 files. Uploading files using Xet Storage.. ``` It shows that I am using xet but finally I got the LFS storage at 40MB, just as large as the raw simple file, no deduplication. Well, maybe it only dedupilcates history commits. And I just generate a twice large file ```numpy import numpy as np a = np.ones(10000000,dtype=int) np.save("./one.npy", a) ``` Then I upload it and get a 120MB LFS storage usage. And during the whole process, the progress bar in terminal shows that I uploaded the whole files(40MB and 80MB) although xet is enabled. I don't know why xet does not work. Any thing wrong? https://discuss.huggingface.co/t/xet-storage-not-deduplicating-for-even-simple-binary-files https://github.com/huggingface/huggingface_hub/issues/3090

jsulz · May 19, 2025, 9:11pm

I think Rajat answered the question of appending to Parquet files here, but just to reiterate: Yes, if you’re appending to a Parquet file and uploading it, only the new chunks will need to be transferred (other content will be deduplicated if it has already been uploaded).

Topic		Replies	Views
How is duplicate data in dataset splits/subsets handled in the hub 🤗Hub	1	61	August 17, 2024
UnexpectedError LFS Storage Used on the dataset has suddenly gone to -55034619833 Bytes 🤗Datasets	2	31	March 10, 2025
My usage of Hub datasets is 595 GB even though I used approximately 4 GB with datasets 🤗Hub	1	80	December 16, 2024
"Challenges in Deploying and Syncing a Hugging Face Space with GitHub Actions Spaces	2	19	April 18, 2025
Error while uploading files larger than 10Mb 🤗Datasets	1	878	October 10, 2022

Xet Storage Not Deduplicating for Even Simple Binary Files

Related topics